Now that we have explored basic descriptive statistics, we can delve
a bit deeper. As a final step, we will look at associations and
correlations between two variables.
Scatter Plots
What values does a variable take at different values of another
variable? We can plot the relationship between two variables with the
plot() command to create a scatter plot. Scatter
plots are a very powerful way to get a sense of how two variables are
related. It is almost always a good idea to have a look at a scatter
plot when you first investigate a new variable of interest.
We might think, for instance, that there is an association between
HDI and GDP per capita (gdp_percapita).
Something would actually be wrong if there weren’t, since GDP per capita
is one of the factors that compose the HDI.
• Create a scatter plot comparing the HDI and GDP per capita.
plot(qog$human_devt_index, qog$gdp_percapita) # plot(x-values, y-values)
What are some other likely associations – or non-associations – in
our dataset?
plot(qog$human_devt_index, qog$polity)
plot(qog$human_devt_index, qog$access_electricity)
plot(qog$human_devt_index, qog$corruption_perceptions_index)
plot(qog$human_devt_index, qog$fragile_state_index)
Correlation
The associations we see in scatter plots may be strong, weak, or
mixed. Another way to gauge the extent of how variables are related is
to consider the correlation between two variables. The
correlation coefficient tells us how, on average, two variables move
together relative to their means. The formula for the correlation
coefficient is the following:
\[\rho_{x y}=\frac{1}{n-1}
\sum_{i=1}^n\left(\frac{x_i-\bar{x}}{\sigma_x} \times
\frac{y_i-\bar{y}}{\sigma_y}\right)\]
Pearson’s correlation coefficient ranges from -1 (perfect negative
correlation) through 0 (no correlation whatsoever) and finally to +1
(perfect positive correlation).
If there is missing data, we must tell R not to use
those observations in its calculation. We do so with the additional
use = "pairwise.complete.obs" argument - this instructs
R only to use observations where there is no missing
value for either of the two variables.
cor(qog$human_devt_index, qog$polity, use="pairwise.complete.obs")
cor(qog$human_devt_index, qog$access_electricity, use="pairwise.complete.obs")
cor(qog$human_devt_index, qog$corruption_perceptions_index, use="pairwise.complete.obs")
cor(qog$human_devt_index, qog$fragile_state_index, use="pairwise.complete.obs")
Finally, assess the relationship between democracy measured by the
Polity score (polity) and state fragility as measured by
the Fragile States Index (fragile_state_index).
• Is democracy correlated with state fragility?
• Produce a scatterplot that shows the association between democracy
(x-axis) and state fragility (y-axis).
• How do the two variables tend to move together? Add lines to the
plot at the mean of each variable.
• Identify Greece and Italy on the plot.
To answer these questions, let’s start with correlation:
cor(qog$polity, qog$fragile_state_index, use="pairwise.complete.obs")
Let’s do the scatter plot next, adding lines for the variables’ means
- feel free to add colours to highlight the means:
plot(qog$polity, qog$fragile_state_index,
main="Relationship between Polity score and State Fragility Index",
xlab="Polity score", ylab="State Fragility Index")
abline(v=mean(qog$polity, na.rm=TRUE))
abline(h=mean(qog$fragile_state_index, na.rm=TRUE))
You can then identify specific countries - recall how we indexed
values in our last session:
# Italy
print(qog$polity[qog$country=="Italy"])
print(qog$fragile_state_index[qog$country=="Italy"])
# Greece
print(qog$polity[qog$country=="Greece"])
print(qog$fragile_state_index[qog$country=="Greece"])
Using the `points command, you can add these countries
to your scatter plot and adjust colours and point sizes:
# Add them to the plot
points(qog$polity[qog$country=="Italy"],
qog$fragile_state_index[qog$country=="Italy"], col="red", pch=17)
points(qog$polity[qog$country=="Greece"],
qog$fragile_state_index[qog$country=="Greece"], col="red", pch=17)
Now identify a country of your choice!
Let’s conclude by identifying more specific cases - this might come
in handy if you see an interesting data point you would like to learn
more about.
• Use indexing and subsetting to identify the most fragile state with
polity==9.
• Use indexing and subsetting to identify the outlier “non-fragile”
state toward the centre of the Polity score.
#Use indexing and subsetting to identify the most fragile state with `polity==9`
max(qog$fragile_state_index[qog$polity==9], na.rm=TRUE)
qog$country[qog$fragile_state_index==97.3]
qog$country[qog$fragile_state_index == 97.3 & !is.na(qog$fragile_state_index)]
qog$country[qog$fragile_state_index==97.3 & !is.na(qog$fragile_state_index) & qog$polity==9]
We first identify the maximum value among all states with Polity
values of nine. However, several observations might have the same value,
at least in principle, so we can’t be sure we have our wanted state.
Thus, we further specify the conditions to Polity scores of nine, a
fragile state index of 97.3 and a none missing fragile state index.
We can do the same for our outlier and actually write all this in one
line of code:
#identify the "non-fragile" state toward the centre of the Polity score.
qog$country[qog$fragile_state_index==min(qog$fragile_state_index[qog$polity==-2], na.rm=TRUE)]
Let’s finally recreate the scatter plot with country abbreviations as
markers - we do this by producing the plot before adding the text
command (thus the empty plot with type=n) with three
arguments: the x-axis, y-axis and the identifier variable that we want
to plot:
plot(qog$polity, qog$fragile_state_index, type="n",
main="Relationship between Polity score and State Fragility Index",
xlab="Polity score", ylab="State Fragility Index",
cex = 0.5)
text(qog$polity, qog$fragile_state_index, qog$iso3c, cex=0.7)
# More beautifully:
plot(qog$polity, qog$fragile_state_index, type="n",
main="Relationship between Polity score and State Fragility Index",
xlab="Polity score", ylab="State Fragility Index",
cex = 0.5)
text(qog$polity[qog$freedomhouse_status=="Free"], qog$fragile_state_index[qog$freedomhouse_status=="Free"], qog$iso3c[qog$freedomhouse_status=="Free"], cex=0.7, col="blue")
text(qog$polity[qog$freedomhouse_status=="Partly Free"], qog$fragile_state_index[qog$freedomhouse_status=="Partly Free"], qog$iso3c[qog$freedomhouse_status=="Partly Free"], cex=0.7, col="violet")
text(qog$polity[qog$freedomhouse_status=="Not Free"], qog$fragile_state_index[qog$freedomhouse_status=="Not Free"], qog$iso3c[qog$freedomhouse_status=="Not Free"], cex=0.7, col="red")
How does this plot look like?